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Abstract — Future Information Retrieval, especially in connec- 
tion with the internet, will incorporate the content descriptions 
that are generated with social network extraction technologies 
and preferably incorporate the probability theory for assigning 
the semantic. Although there is an increasing interest about social 
network extraction, but a little of them has a significant impact 
to infomation retrieval. Therefore this paper proposes a model 
of information retrieval from the social network extraction. 

Keywords - singleton event; doubleton event; space of event; 
space of relation; logic; imaging; probabilty; Jaccard coefficient. 

I. Introduction 

Information Retrieval (IR) is concerned with answering 
information needs as accurately as possible U), 0. The infor- 
mation is found in ever-growing documents collections such 
as web pages, for which information extraction algorithms are 
being developed on a large scale 0. Social network extraction 
establishes a technology to identify and describe special con- 
tent: entities and their relations, that is represetation of web 
pages and query that are enriched with semantic information 
of their respective content @), 0, 0, 0. However, there 
is little attention about the application of social network to 
IR 0. Although, it will give rise to adapted and advanced 
retrieval model [9|. 

We assure that this technology will become an important 
part of IR, and IR as an application where the extracted social 
network contributes to a more refined representation of web 
pages and query. It is our goal in this paper, a model is 
defined by the query and web pages representations and by 
the function that estimates the relevance of a web page and a 
query based on relations among entities pairly. 

II. Related Work 

Well known models of IR are the Boolean, vector space, 
probabilistic, fuzzy and imaging. These have been studied 
in detail and implementation for experimentation, as well as, 
commercial purposes. Nevertheless, the known limitations of 
these models have caused researchers to propose new models. 
One such model is the logical model for IR iflOl, ifTTl, fl2l. 
Ifl3l . In recent years there have been several attemps to define 
a logic for IR along the so-called logical approach, initiated 
by the pioneer work 1 an d given decisive impulse by two 
related works IT4ll . lfT31 . Logical IR models were studied to 
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provide a rich and uniform representation of information and 
its semantics, with the aim to improve retrieval effectiveness. 

In line with those logic studies, there are formal researches 
dealing with Imaging in IR ifTBI . This idea explicitly proposed 
lfT31l and implemented IfTTl . [19| mainly to solve uncertainty 
problem in IR. They enable a more complex definitions of 
relevance than other IR models. Using similar approach, our 
model also based on logical-uncertainty and probability theory 
for enhancing IR by using the approaches of social network 
extraction, where the output of IR system based on score 
of relevance so that documents can be sorted according to 
relevance to the query, but future models will incorporate the 
content descriptions that are generated with IR and preferably 
incorporate the probabilistic nature of the assignments of the 
semantics. 

III. The Concept and Motivation 

Social networks extraction is the web pages (or documents) 
based process developed in the framework of modal relation 
0. In the semantic web, there is one research stream of social 
network extraction depends heavily on the co-occurrence 
as modal relation by utilizing the Cartesian product for 
clustering on the space of events [4], [6|. 



Definition 3.1 [20] Let V ^ is a set of nodes and E is a set 
of edges. The social network extraction with the exertions £ 
and ( for acquiring rich and trusted social network is 

SN=(V,E,A,R,Z,{,C) 
that satisfies the following conditions 

1) f (1 : 1) : A -» V, v = f (a), Va E A 3\v E V, where A 
is a set of actors. 

2) C : R -> E so that e s = CMM)) = ((Z a n Z b ), 
ej E E,r k E R, Va, b E A, Z a , Z b , Z a n Z b C Z, where 
Z is a set of attributes. 

In Definition 3.1, the use of social network data in IR is 
a motivation for computing the importance of an actor in the 
social network that is extracted from the web documents and 
for using this relevance to compute the importance a web page. 
It also enables the relations among the social actors (entities) 



allow us to create and maintain an aggregate of close web- 
documents. 

Information Retrieval is a knowledge technology concerned 
with the effective and efficient retrieval of information 
for the subsequent use by interested parties. Information 
retrieval typically involves the querying of unstructured 
or semi-structured information, the former referring to the 
content of unstructured text (written or spoken), images, 
video and audio, the latter referring to well-defined metadata 
that are attached to the web pages especially. Like other 
technologies, IR can also be formulated using the logic lfl8ll . 
Logical reasoning is the essence of the defining paradigms 
and for understanding the phenomena in the space of events. 
In classical logic where inference is often associated with 
logical implication: a web page to is relevant to a query q if 
it implies the query, or in other words, if the query can be 
inferred from the web page, that is if uj => q (read "if uj then 
q") is true. A well-kown paradigm of querying a web page is 
by inputting key terms and matching them against the terms 
by which the web pages are indexed. The term is the words 
of the texts in case of a full text search we define as follows. 

Definition 3.2 A term t x consists of at least one or a set 
of words in a pattern, or t x — (wi, . . . ,wi), I < k, k is a 
number of parameters representing word w, I is number of 
tokens (vocabularies) in t x , \t x \ = k is size of t x . 

We use the term for defining the singleton event as follows. 

Definition 3.3 Let a set of Web pages indexed by search engine 
be 57. For each search term t a , where t a G E, i.e., a set 
of singleton search term of search engine. A vector space 
57 a C 51 is a singleton search engine event of web pages 
(singleton event) that contain an occurrence (event) of t a G uj. 
The probability of a singleton event 57 a is 



p(t a ) = \n a \/\n\ g [o,i] 

where 57 1 is the cardinality of 57, and |57 a < 57 1. 



(1) 



Let t a represents an actor/entity name and t a is in the query, 
then cj t a . The implication of uj t a is as an interpretation 
of exertion £ in Definition 3.1. However, logic by itself cannot 
fully model IR. In determining the relevance of a web page to a 
query, the success or failure of an implication relating the true 
or false values is not enough. Although, a web page consists 
of a set of statements, or {si\i = 1, . . . , m}, but a collection 
of web pages cannot be considered as a consistent set of state- 
ments containing t a . In fact, the web pages in the collection 
could and often do contradict each other in any particular logic, 
and not all the necessary knowledge is available. In case of 
name disambiguation, 57 t = {uJi\i = 1, . . . , 1} is a set of web 
pages containing the names where a name could and often 
consist of different patterns of name tokens (first/middle/last 
names or in abbreviation). Together with growing the web on 
Internet, the presences of semantic relation such as synonymy 
and polysemy gave (a) different entities can share the same 



name, and (b) a single entity can be designated by multiple 
names. Therefore, the relationships between web pages and 
the entities in logic is uncertain, and degree of uncertainty 
measured by P(uj t a ) and estimated by the conditional 
probability P(t a \uj), a conditional events (t a nui)/ui. 

For singleton event the 57 be the space of events where 
uj => t a is true, or 



tl a (t a ) 



1 if t a is true at u a G 57 
otherwise, 



but n a (t a ) = 1 also for W\, IU2, . . . , Wi G t a without their 
pattern (such as t a = (wx, W2, ■ ■ ■ ,wi)) is true at any uj G 57. 
Each of the returned web pages may contain many relevant 
information and even some irrelevant ones. Thus, 



1 a, 



a 

) = 1) > => t a 



(2) 



t a ) is the number of web pages containing t a . 



Therefore, to assemble the relevant information mentioned in 
web pages, we can identify the most relevant information 
amongst those mentioned in the top n results based on a 
insight: how often the information is mentioned across the top 
results also provides important hint about its relevance to the 
query. The "relevant information" means that if a user of an IR 
system has an information need, such as relevance is defined 
as logical consequence 0], whereby the query is represented 
by statement and its negation that consist of premiss set and 
minimal premiss set, i.e., a statement as a logical consequence 
of subset of sentences, and a statement is one that is as small as 
possible in the sense that if any of its members were deleted. 
It enables we can define a boundary j3 a such that 

^(w^t o )~|0 o |-^ o . (3) 

n 

or based on Definition 3.1 and Definition 3.2, we have 

v = £(a) 

= ^(En(w *-)) W 

~ (\n a \-/3 a )/\n\ 

It also proves that the following proposition. 

Proposition 3.1 A web page is relevant to an entity 
information need, uj = t a , if web page contains at least one 
sentence where there is t a in name disambiguation condition 

(3!s,t a G s). 

For extracting the social networks from web pages, to 
accompany singleton we define a doubleton event. 

Definition 3.4 Let a set of Web pages indexed by search engine 
be ft. We assume t a and tb are search terms, t a ^ tt,, t ai % G 
S, where S is a set of singleton search term of search engine. 
A doubleton search term is {{t a ,tt,} : t a ,ti> G S} and its 
vector space denoted by 51 a n Of, C 51 is a doubleton search 
engine event of web pages (doubleton event) that contain a co- 
occurrence of t a and ti, such that t ai tb G uj a and t a ,ti, G uJb- 



Probability of a doubleton event f2 n fib is 

p(t a ,t b ) = |n a nn 5 |/|n| e [0,1] 

where fl a , Sl^, fl a nflb C fl. 



(5) 



In social network, the researchers leverage the top relevant 
web pages and relationship between web pages to identify the 
most relevant entities J6), or generate a relationship between 
one entity to another based on relevant web pages. It means 
that also how close one web page to another. 

Theorem 3.1 The web pages are relevant to an information 
need for the social network if it contains at least one sentence 
which is relevant to the relation between two entities. 

We borrow the use of logic in imaging |[T6l to prove this 
theorem. Let two terms t a ^ t b for different entities, u) =>• 
t a A t b . Let fl a and Of, are the spaces of event for uj a =>• t a 
and w b => t& are true, respectively. fl a be most similar to fib 
where t a is true, then t a =>• t b will be true at fl b if and only 
if tb is true at fl a , that is fl b (t a ) = 1 if t a is true at fl b then 
we have 

n b (t a ^ t b ) = n a (t b ) (6) 

where fl a {t b ) = 1 if *6 is true at &a- Similarly, based on 
symmetry of similarity, we obtain 



(7) 



If u a £ fl a , uj a £ fl b and ujb £ fib, w b & &a> then it applies 
that t a and t b be co-occurrence in fl, or fl a Dfl b (t a =>■ t&) = 1, 
where fl a ,flb C SI, and 



^a(^a => t b )Afi b (t b t a ) 

1 A 1 

i. 



f2 a n fib is the space of event if uj a => t a A ^ and uj b => i Q A tj, 
are true. 

Similar to singleton event, based on Theorem 3.1 the 
doubleton event be space of event where u) ==> t a A tb is true. 
Thus, 



\n ab \ = |^ a nr2 b | 

= En( fi a & (£a,^) = l)>£a( W ^aA^). 



or with a boundary /3 ab , 



E( 

12 



Co' 



t a At b ) ~ |fi a nfi b | -j8 o6 , 



or from Definiton 3.1 and Definition 3.4, we obtain 

e ab 



C(a,b) 



or in Jaccard coefficient 



e-ab = 



\n a nn b \ -/3 ab 



(8) 



(9) 



(10) 



(11) 



\n a \ + \n b \-\n a nn b \-f3 a -i3b + p a b 

Because of |f2 a n Q b \ < fi a or |f2 a n Cl b \ < fib, f3 ab is less 
than or equal to (3 a or f3 b . 



The doubleton event is an attempt in order to two entities 
are related when they are often mentioned in the same context, 
mainly author-coauthor relationship at their academic papers. 
This attempt is to be distinguished from looser ones like, for 
instance, the vector space in which web pages are ranked 
according to a measure of similarity with the query. Of course, 
the social network extraction based on a treatment of simi- 
larity, mainly to find out the strength relations operationally 
in a number of ways, and the similarity based models of 
IR generally lack the theoretical soundness of probabilities 
models. However, IR is able to search efficiently through huge 
amounts of data because it builds indexes from the web pages 
(documents), where all kinds of information needs that are 
very difficult to determine a priori. Moreover, The queries 
with the words do not always occur in relevant web pages. 
Thus, the query expansion with synonym and related terms (or 
have relation) is one popular alternative, primarily enhancing 
the recall of the results of the search, but Eqs. (10) and (11) 
mean that the relation is uncertain also. Therefore, although 
a measure of similarity cannot be directly interpretable as a 
probability, we can use probabilistic inference to cope with 
uncertainty a relation. 

IV. Model of IR 

Social network reflects a shift from the individualism 
common towards a social structure, or an exchange from 
information individually to the information of relations, where 
the fundamental units were defined, i.e., the relations between 
entities. Relations are characterized by content, direction and 
strength. The content of a relation refers to the resource that 
is exchanged, a relation can be directed or undirected, and 
relation also differ in strength ETI . 

Suppose we have a set of possible relations R. Definition 3.3 
means that t a =>■ lu, Vuj £ O a such that lu connect to another in 
singleton event by t a . In this case, the space of event Sl a be in 
a space of relation p (or there are a web pages network) where 
£ generates p, and degree of uncertainty can be measured by 
Eq. (4). For all p x £ R be the space of relations tie to p y £ R 
where the space of event il a is true, and il a => il b will be 
true at p y if and only if Q, b is true at p x . Or 



/Q n _ / 1 if O a is true at p y 



and 



Oxi^b) = | 



otherwise 

1 if fib is true at p x 
otherwise 



then we obtain 



p y (fla => fib) = Px(flb)- (12) 

Similarly, if p x and p y are the spaces of symmetry relations, 
we have 

p x (fl b => fl a ) =Py(fla)- (13) 

From Eqs. (6) and (7), we obtain 

p x (fl b (t a =*> t b ) fl a (t b => t a )) = p y (fl a (t b ) fl b (t a )) 

= p y {fl a => fib) 
= Px{flb) 



It concludes into the following statement. 

Theorem 3.2 If a social network can be extracted from web 
pages (or collection of documents), then web pages as space 
of events in the space in relations is a document network that 
represent the document collection. 

In IR model, this theorem is to provide that u> =4> t a A tb 
is also t a A tb => to in space of relation p 6 R, where £ and 
( (social network extraction) generate p, i.e. first is in web 
pages networks and second is in actors networks. 

A treatment of probability of either event or relation spaces 
based on a probability distribution over the set of possible 
events £1 or set of possible relations R, respectively. There are 
2~2q -P( w ) = 1 an d 2~2r -P(^) = 1- By imaging we have 

P(u^q) = P 0J (q) = J2 P (^Uq), d4) 
n 

where Sl u (q) = 1 if q is true at Sl u , Sl u (q) = otherwise. 

P(j>=>u,) = P p (u) = '£ i P(p)R p (u>), (15) 

R 

if uj is true at R p then R p (oj) = 1 else R p (uj) = 0. Probability 
of relation in a social network, based on uncertainty of Eqs. 
(3) and (8) such as Eq. (11), depends on satisfying a threshold 
that derived by the boundaries. In an inference network, the 
truth value of a node depends only upon the truth values of 
its parents. To evaluate the strength of an inference chain 
going from one web page to the query we set the web page 
node Wi to true and evaluate P{qu = true\uji = true). This 
gives us an estimate of P(cOi => qt), where P(oj => p) 
is document network represents the collection of web pages 
such as singleton events Q a ,Qb C O or doubleton event 
O a Pi Oft C il. While a query network is built for each 
information need and can be modified and extended during 
each session by the user in a interaction and dynamic way, in 
an inference network we have 



where 



P(p^q) = P p (q)=J2P(p)Rp(<l), 



R p (q) = {l if « is imc al 



(16) 



. otherwise. 

We substitute Eqs. (15) and (16) into Eq. (14) as follows 

P(uj^t a At b ) = P(u)=>q) 

= P((p => Oj) =► (p => q)) 

= P(p => Oj) => P(p q) 

= P p (uj)^P p (q) 

= Y, R P(p)R P {u)^2Z R P{p)R P {z) 

= Eh P(j>)R P (q) E R P{p)Rp{") 

= P p (q)^P p (uj) 

= P(p^q)^ P(p^Lu) 

= P((p => q) => (p => OJ)) 

= P(q^uj) 

= P(t a A t b OJ) 



If t a ,tb in q, then a web page uj relevant to a relation. 
Therefore, the evidence is two or more terms tj or relations 
Pj together are relevant for the web pages. Difference combi- 
nations of terms in query can be activated and their relevance 
can be computed on social networks of their recorded entities. 

V. Conclusion and Future Work 

The social network will usually be manageable in terms 
(represent entities) of computational complexity. It is also 
possible to activate not one candidate entity or passage when 
computing relevance, but considering a number of combina- 
tions of relations between entities to be active and to compute 
the relevance of the set. One can compute the event that two 
or more entities ti, i = 1, . . . , n or relations pj,j = 1, . . . , m 
together are relevant for the query. The terms of the entities can 
be linked to different concepts, which, for instance represent 
corefering entities or events. The relations can be extracted 
from different web pages, and all possible of web pages as an 
answer to the question is computational feasible or not. Our 
near future work is to further experiment the proposed method 
and look into the possibility of enhancing IR performance by 
using social networks. 
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